Skip to content

[Bugfix] Fix IndexError in Qwen3CoderToolParser streaming#34495

Open
R3hankhan123 wants to merge 1 commit into
vllm-project:mainfrom
R3hankhan123:qwen-code-parser-fix
Open

[Bugfix] Fix IndexError in Qwen3CoderToolParser streaming#34495
R3hankhan123 wants to merge 1 commit into
vllm-project:mainfrom
R3hankhan123:qwen-code-parser-fix

Conversation

@R3hankhan123
Copy link
Copy Markdown
Contributor

@R3hankhan123 R3hankhan123 commented Feb 13, 2026

Purpose

When streaming tool calls, the parser attempted to access arguments[func_start:] before validating that func_start is within bounds. This caused an IndexError when the function name appeared at the very end of the streamed chunk.

Fixes #34322

Test Plan

  1. Build the image
  2. Send inference request with tool calling

Test Result

Server logs

root@openshiftai-vllm:~/vllm# docker run   --gpus all   --ipc=host   --shm-size=10g   -p 8000:8000   vllm:qwenfix  Qwen/Qwen3-Coder-Next-FP8   --tensor-parallel-size 2   --max-model-len 2048   --served-model-name Qwen3-Coder-Next   --enable-auto-tool-choice   --tool-call-parser qwen3_coder   --max-num-batched-tokens 512   --gpu-memory-utilization 0.98   --disable-custom-all-reduce --enforce-eager
(APIServer pid=1) INFO 02-13 06:05:31 [utils.py:287] 
(APIServer pid=1) INFO 02-13 06:05:31 [utils.py:287]        █     █     █▄   ▄█
(APIServer pid=1) INFO 02-13 06:05:31 [utils.py:287]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.1.dev13827+g16341f4bb.d20260213
(APIServer pid=1) INFO 02-13 06:05:31 [utils.py:287]   █▄█▀ █     █     █     █  model   Qwen/Qwen3-Coder-Next-FP8
(APIServer pid=1) INFO 02-13 06:05:31 [utils.py:287]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 02-13 06:05:31 [utils.py:287] 
(APIServer pid=1) INFO 02-13 06:05:31 [utils.py:223] non-default args: {'model_tag': 'Qwen/Qwen3-Coder-Next-FP8', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'Qwen/Qwen3-Coder-Next-FP8', 'max_model_len': 2048, 'enforce_eager': True, 'served_model_name': ['Qwen3-Coder-Next'], 'tensor_parallel_size': 2, 'disable_custom_all_reduce': True, 'gpu_memory_utilization': 0.98, 'max_num_batched_tokens': 512}
(APIServer pid=1) INFO 02-13 06:05:44 [model.py:531] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=1) INFO 02-13 06:05:44 [model.py:1555] Using max model len 2048
(APIServer pid=1) INFO 02-13 06:05:44 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=512.
(APIServer pid=1) INFO 02-13 06:05:44 [config.py:504] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 02-13 06:05:44 [config.py:535] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 02-13 06:05:44 [vllm.py:698] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 02-13 06:05:44 [vllm.py:736] Enforce eager set, overriding optimization level to -O0
(APIServer pid=1) INFO 02-13 06:05:44 [vllm.py:854] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=221) INFO 02-13 06:05:53 [core.py:97] Initializing a V1 LLM engine (v0.1.dev13827+g16341f4bb.d20260213) with config: model='Qwen/Qwen3-Coder-Next-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Coder-Next-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3-Coder-Next, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=221) WARNING 02-13 06:05:53 [multiproc_executor.py:921] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-13 06:05:58 [parallel_state.py:1246] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40837 backend=nccl
INFO 02-13 06:05:58 [parallel_state.py:1246] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40837 backend=nccl
INFO 02-13 06:05:58 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 02-13 06:05:58 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 02-13 06:05:58 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
INFO 02-13 06:05:58 [parallel_state.py:1474] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
INFO 02-13 06:05:58 [parallel_state.py:1474] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A
(Worker_TP0 pid=323) INFO 02-13 06:05:59 [gpu_model_runner.py:4124] Starting to load model Qwen/Qwen3-Coder-Next-FP8...
(Worker_TP0 pid=323) INFO 02-13 06:05:59 [fp8.py:338] Using TRITON Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU'].
(Worker_TP0 pid=323) INFO 02-13 06:06:00 [cuda.py:367] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker_TP1 pid=324) INFO 02-13 06:20:02 [weight_utils.py:539] Time spent downloading weights for Qwen/Qwen3-Coder-Next-FP8: 839.750869 seconds
(Worker_TP0 pid=323) INFO 02-13 06:20:07 [weight_utils.py:539] Time spent downloading weights for Qwen/Qwen3-Coder-Next-FP8: 3.747348 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/40 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/40 [00:10<06:44, 10.38s/it]

Loading safetensors checkpoint shards:   5% Completed | 2/40 [04:18<1:35:14, 150.38s/it]
Loading safetensors checkpoint shards:   8% Completed | 3/40 [06:47<1:32:20, 149.75s/it]
Loading safetensors checkpoint shards:  10% Completed | 4/40 [08:41<1:21:21, 135.61s/it]
Loading safetensors checkpoint shards:  12% Completed | 5/40 [08:42<50:45, 87.00s/it]
Loading safetensors checkpoint shards:  15% Completed | 6/40 [08:43<32:41, 57.68s/it]
Loading safetensors checkpoint shards:  18% Completed | 7/40 [08:43<21:28, 39.06s/it]
Loading safetensors checkpoint shards:  20% Completed | 8/40 [08:44<14:19, 26.87s/it]
Loading safetensors checkpoint shards:  22% Completed | 9/40 [08:45<09:39, 18.70s/it]
Loading safetensors checkpoint shards:  25% Completed | 10/40 [08:46<06:35, 13.17s/it]
Loading safetensors checkpoint shards:  28% Completed | 11/40 [08:48<04:47,  9.90s/it]
Loading safetensors checkpoint shards:  30% Completed | 12/40 [08:49<03:19,  7.12s/it]
Loading safetensors checkpoint shards:  32% Completed | 13/40 [08:50<02:20,  5.20s/it]
Loading safetensors checkpoint shards:  35% Completed | 14/40 [08:51<01:40,  3.88s/it]
Loading safetensors checkpoint shards:  38% Completed | 15/40 [08:52<01:14,  2.97s/it]
Loading safetensors checkpoint shards:  40% Completed | 16/40 [08:52<00:55,  2.30s/it]
Loading safetensors checkpoint shards:  42% Completed | 17/40 [08:54<00:46,  2.01s/it]
Loading safetensors checkpoint shards:  45% Completed | 18/40 [08:54<00:35,  1.63s/it]
Loading safetensors checkpoint shards:  48% Completed | 19/40 [08:55<00:28,  1.38s/it]
Loading safetensors checkpoint shards:  50% Completed | 20/40 [08:56<00:24,  1.20s/it]
Loading safetensors checkpoint shards:  52% Completed | 21/40 [08:57<00:20,  1.07s/it]
Loading safetensors checkpoint shards:  55% Completed | 22/40 [08:59<00:23,  1.30s/it]
Loading safetensors checkpoint shards:  57% Completed | 23/40 [08:59<00:19,  1.15s/it]
Loading safetensors checkpoint shards:  60% Completed | 24/40 [09:00<00:17,  1.06s/it]
Loading safetensors checkpoint shards:  62% Completed | 25/40 [09:01<00:14,  1.03it/s]
Loading safetensors checkpoint shards:  65% Completed | 26/40 [09:02<00:12,  1.10it/s]
Loading safetensors checkpoint shards:  68% Completed | 27/40 [09:02<00:11,  1.15it/s]
Loading safetensors checkpoint shards:  70% Completed | 28/40 [09:04<00:11,  1.05it/s]
Loading safetensors checkpoint shards:  72% Completed | 29/40 [09:04<00:09,  1.10it/s]
Loading safetensors checkpoint shards:  75% Completed | 30/40 [09:05<00:08,  1.18it/s]
Loading safetensors checkpoint shards:  78% Completed | 31/40 [09:06<00:07,  1.22it/s]
Loading safetensors checkpoint shards:  80% Completed | 32/40 [09:07<00:06,  1.25it/s]
Loading safetensors checkpoint shards:  82% Completed | 33/40 [09:09<00:08,  1.25s/it]
Loading safetensors checkpoint shards:  85% Completed | 34/40 [09:10<00:06,  1.13s/it]
Loading safetensors checkpoint shards:  88% Completed | 35/40 [09:11<00:05,  1.02s/it]
Loading safetensors checkpoint shards:  90% Completed | 36/40 [09:11<00:03,  1.05it/s]
Loading safetensors checkpoint shards:  92% Completed | 37/40 [09:12<00:02,  1.11it/s]
Loading safetensors checkpoint shards:  95% Completed | 38/40 [09:14<00:02,  1.30s/it]
Loading safetensors checkpoint shards:  98% Completed | 39/40 [09:15<00:01,  1.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 40/40 [09:16<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 40/40 [09:16<00:00, 13.91s/it]
(Worker_TP0 pid=323) 
(Worker_TP0 pid=323) INFO 02-13 06:29:24 [default_loader.py:293] Loading weights took 556.65 seconds
(Worker_TP0 pid=323) INFO 02-13 06:29:24 [fp8.py:495] Using MoEPrepareAndFinalizeNoEP
(Worker_TP0 pid=323) INFO 02-13 06:29:24 [gpu_model_runner.py:4221] Model loading took 37.5 GiB memory and 1404.631972 seconds
(EngineCore_DP0 pid=221) INFO 02-13 06:30:25 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=323) WARNING 02-13 06:30:50 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=6144,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP1 pid=324) WARNING 02-13 06:30:51 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=6144,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP0 pid=323) WARNING 02-13 06:30:56 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=2048,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP1 pid=324) WARNING 02-13 06:30:57 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=2048,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP0 pid=323) WARNING 02-13 06:30:57 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=512,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP1 pid=324) WARNING 02-13 06:30:58 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=512,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP1 pid=324) WARNING 02-13 06:30:58 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=2048,K=256,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP0 pid=323) WARNING 02-13 06:30:58 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=2048,K=256,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP0 pid=323) WARNING 02-13 06:31:03 [fused_moe.py:1086] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=256,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP1 pid=324) WARNING 02-13 06:31:16 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=4608,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(Worker_TP0 pid=323) WARNING 02-13 06:31:16 [fp8_utils.py:1165] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/configs/N=4608,K=2048,device_name=NVIDIA_L40S,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=221) INFO 02-13 06:31:25 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=323) INFO 02-13 06:31:32 [gpu_worker.py:375] Available KV cache memory: 4.01 GiB
(EngineCore_DP0 pid=221) INFO 02-13 06:31:32 [kv_cache_utils.py:1308] GPU KV cache size: 87,584 tokens
(EngineCore_DP0 pid=221) INFO 02-13 06:31:32 [kv_cache_utils.py:1313] Maximum concurrency for 2,048 tokens per request: 92.00x
(Worker_TP0 pid=323) INFO 02-13 06:31:32 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
(Worker_TP1 pid=324) INFO 02-13 06:31:32 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
(EngineCore_DP0 pid=221) INFO 02-13 06:31:34 [core.py:278] init engine (profile, create kv cache, warmup model) took 129.21 seconds
(EngineCore_DP0 pid=221) INFO 02-13 06:31:44 [vllm.py:698] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=221) WARNING 02-13 06:31:44 [vllm.py:743] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=221) INFO 02-13 06:31:44 [vllm.py:854] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 02-13 06:31:44 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=1) INFO 02-13 06:32:01 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 02-13 06:32:02 [model.py:1356] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 40, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 02-13 06:32:02 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) INFO 02-13 06:32:02 [serving.py:188] Warming up chat template processing...
(APIServer pid=1) INFO 02-13 06:32:06 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 02-13 06:32:06 [serving.py:213] Chat template warmup completed in 4028.5ms
(APIServer pid=1) INFO 02-13 06:32:06 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) INFO 02-13 06:32:06 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 02-13 06:32:06 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO 02-13 06:32:37 [qwen3coder_tool_parser.py:85] vLLM Successfully import tool parser Qwen3CoderToolParser !
(EngineCore_DP0 pid=221) INFO 02-13 06:33:37 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(APIServer pid=1) INFO 02-13 06:34:35 [loggers.py:259] Engine 000: Avg prompt throughput: 35.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 02-13 06:34:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 02-13 06:34:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 02-13 06:34:57 [qwen3coder_tool_parser.py:85] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=1) INFO:     172.17.0.1:35340 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 02-13 06:35:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 02-13 06:35:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

(APIServer pid=1) INFO 02-13 06:35:55 [loggers.py:259] Engine 000: Avg prompt throughput: 14.8 tokens/s, Avg generation throughput: 0.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:40932 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 02-13 06:36:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 02-13 06:36:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Curl request

root@openshiftai-vllm:~# curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Coder-Next",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather like in San Francisco and Paris?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather in a given location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "The temperature unit to use"
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1824  100   921  100   903      6      6  0:02:33  0:02:20  0:00:13   192
{
  "id": "chatcmpl-944d8a5333c19074",
  "object": "chat.completion",
  "created": 1770964357,
  "model": "Qwen3-Coder-Next",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [
          {
            "id": "chatcmpl-tool-b43c4c23c8debc77",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\": \"San Francisco, CA\", \"unit\": \"fahrenheit\"}"
            }
          },
          {
            "id": "chatcmpl-tool-97b23e5873ddd6c9",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\": \"Paris\", \"unit\": \"celsius\"}"
            }
          }
        ],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "tool_calls",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 352,
    "total_tokens": 423,
    "completion_tokens": 71,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}
root@openshiftai-vllm:~# 
root@openshiftai-vllm:~# 
root@openshiftai-vllm:~# # Send tool results back to continue the conversation
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Coder-Next",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather like in San Francisco and Paris?"
      },
      {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "chatcmpl-tool-b43c4c23c8debc77",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\": \"San Francisco, CA\", \"unit\": \"fahrenheit\"}"
            }
          },
          {
            "id": "chatcmpl-tool-97b23e5873ddd6c9",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\": \"Paris\", \"unit\": \"celsius\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "chatcmpl-tool-b43c4c23c8debc77",
        "content": "{\"temperature\": 72, \"unit\": \"fahrenheit\", \"description\": \"sunny\"}"
      },
      {
        "role": "tool",
        "tool_call_id": "chatcmpl-tool-97b23e5873ddd6c9",
        "content": "{\"temperature\": 15, \"unit\": \"celsius\", \"description\": \"cloudy\"}"
      }
    ]
  }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2004  100   800  100  1204     48     73  0:00:16  0:00:16 --:--:--   189
{
  "id": "chatcmpl-a7d7e3fea16c9329",
  "object": "chat.completion",
  "created": 1770964548,
  "model": "Qwen3-Coder-Next",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here’s the current weather:\n\n- **San Francisco, CA**: 🌞 **72°F (22°C)** — sunny  \n- **Paris**: ☁️ **15°C (59°F)** — cloudy  \n\nLet me know if you'd like a forecast for later today or any other details! 😊",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 148,
    "total_tokens": 215,
    "completion_tokens": 67,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@R3hankhan123
Copy link
Copy Markdown
Contributor Author

@haosdent fyi

@mergify mergify Bot added qwen Related to Qwen models bug Something isn't working labels Feb 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an IndexError in the Qwen3CoderToolParser during streaming tool calls by adding necessary boundary checks before accessing self.streamed_args_for_tool. The fix appears correct in preventing the crash. However, the same checking logic has been duplicated in four separate locations within the extract_tool_calls_streaming method. My review includes a suggestion to refactor this duplicated code into a single helper method to improve the code's maintainability and reduce redundancy.

Comment thread vllm/tool_parsers/qwen3coder_tool_parser.py Outdated
@haosdent
Copy link
Copy Markdown
Contributor

Here is my local change, you may take a reference and update the PR if need

diff --git a/tests/tool_parsers/test_qwen3coder_tool_parser.py b/tests/tool_parsers/test_qwen3coder_tool_parser.py
index 3d46f73de..b5611e930 100644
--- a/tests/tool_parsers/test_qwen3coder_tool_parser.py
+++ b/tests/tool_parsers/test_qwen3coder_tool_parser.py
@@ -40,7 +40,7 @@ def qwen3_xml_tool_parser(qwen3_tokenizer):
     return Qwen3XMLToolParser(qwen3_tokenizer)
 
 
-@pytest.fixture(params=["xml"])
+@pytest.fixture(params=["original", "xml"])
 def qwen3_tool_parser_parametrized(qwen3_tool_parser, qwen3_xml_tool_parser, request):
     """Parameterized fixture that provides both parser types for testing"""
     if request.param == "original":
@@ -668,6 +668,17 @@ def test_extract_tool_calls_streaming(
         expected_tool_calls
     )
 
+    # Verify streamed_args_for_tool contract with serving layer:
+    # len must match prev_tool_call_arr, and each entry must be valid JSON.
+    # This prevents IndexError when the serving layer accesses
+    # streamed_args_for_tool[index] at generation finish (issue #34322).
+    parser = qwen3_tool_parser_parametrized
+    assert len(parser.streamed_args_for_tool) == len(parser.prev_tool_call_arr)
+    for i in range(len(parser.prev_tool_call_arr)):
+        streamed = parser.streamed_args_for_tool[i]
+        assert streamed, f"streamed_args_for_tool[{i}] should not be empty"
+        json.loads(streamed)  # must be valid JSON
+
     # Verify each tool call
     for idx, expected_tool in enumerate(expected_tool_calls):
         state = tool_states[idx]
@@ -899,12 +910,14 @@ def test_extract_tool_calls_complex_type_with_single_quote(
 
 
 def test_extract_tool_calls_streaming_missing_opening_tag(
-    qwen3_tool_parser_parametrized, qwen3_tokenizer, sample_tools
+    qwen3_xml_tool_parser, qwen3_tokenizer, sample_tools
 ):
     """Test streaming with missing opening <tool_call> tag
 
-    This tests that the streaming parser correctly handles
-    tool calls that start directly with <function=...>
+    This tests that the XML streaming parser correctly handles
+    tool calls that start directly with <function=...>.
+    This is an XML parser-specific fallback; the original parser
+    requires the <tool_call> wrapper.
     """
     model_output = """I'll check the weather for you.
 
@@ -927,7 +940,7 @@ fahrenheit
     tool_states = {}
 
     for delta_message in stream_delta_message_generator(
-        qwen3_tool_parser_parametrized, qwen3_tokenizer, model_output, request
+        qwen3_xml_tool_parser, qwen3_tokenizer, model_output, request
     ):
         if delta_message.content:
             other_content += delta_message.content
@@ -963,7 +976,7 @@ fahrenheit
 
     # Verify we got the tool call
     assert len(tool_states) == 1
-    assert len(qwen3_tool_parser_parametrized.prev_tool_call_arr) == 1
+    assert len(qwen3_xml_tool_parser.prev_tool_call_arr) == 1
 
     state = tool_states[0]
     assert state["id"] is not None
diff --git a/vllm/tool_parsers/qwen3coder_tool_parser.py b/vllm/tool_parsers/qwen3coder_tool_parser.py
index a3c79f865..0b5590fb9 100644
--- a/vllm/tool_parsers/qwen3coder_tool_parser.py
+++ b/vllm/tool_parsers/qwen3coder_tool_parser.py
@@ -482,17 +482,13 @@ class Qwen3CoderToolParser(ToolParser):
                     # IMPORTANT: Add to prev_tool_call_arr immediately when
                     # we detect a tool call. This ensures
                     # finish_reason="tool_calls" even if parsing isn't complete
-                    already_added = any(
-                        tool.get("name") == self.current_function_name
-                        for tool in self.prev_tool_call_arr
+                    self.prev_tool_call_arr.append(
+                        {
+                            "name": self.current_function_name,
+                            "arguments": {},  # Placeholder, will be updated later
+                        }
                     )
-                    if not already_added:
-                        self.prev_tool_call_arr.append(
-                            {
-                                "name": self.current_function_name,
-                                "arguments": "{}",  # Placeholder, will be updated later
-                            }
-                        )
+                    self.streamed_args_for_tool.append("")
 
                     # Send header with function info
                     return DeltaMessage(
@@ -514,6 +510,7 @@ class Qwen3CoderToolParser(ToolParser):
             # Send opening brace if not sent yet
             if not self.json_started and self.parameter_prefix not in delta_text:
                 self.json_started = True
+                self.streamed_args_for_tool[self.current_tool_index] += "{"
                 return DeltaMessage(
                     tool_calls=[
                         DeltaToolCall(
@@ -550,16 +547,14 @@ class Qwen3CoderToolParser(ToolParser):
                             else None,
                         )
                         if parsed_tool:
-                            # Update existing entry in
-                            # prev_tool_call_arr with complete args
-                            for i, tool in enumerate(self.prev_tool_call_arr):
-                                if tool.get("name") == parsed_tool.function.name:
-                                    args = parsed_tool.function.arguments
-                                    self.prev_tool_call_arr[i]["arguments"] = args
-                                    break
+                            # Update current tool entry with complete args
+                            self.prev_tool_call_arr[self.current_tool_index][
+                                "arguments"
+                            ] = json.loads(parsed_tool.function.arguments)
                     except Exception:
                         pass  # Ignore parsing errors during streaming
 
+                self.streamed_args_for_tool[self.current_tool_index] += "}"
                 result = DeltaMessage(
                     tool_calls=[
                         DeltaToolCall(
@@ -675,6 +670,9 @@ class Qwen3CoderToolParser(ToolParser):
                             )
 
                         self.param_count += 1
+                        self.streamed_args_for_tool[self.current_tool_index] += (
+                            json_fragment
+                        )
 
                         return DeltaMessage(
                             tool_calls=[

Compared to your pull request, it includes:

  • Removes the faulty already_added dedup, so parallel same-name tool calls work
  • Stores arguments as a dict (not JSON string) matching the serving layer's json.dumps() contract
  • Updates prev_tool_call_arr by current_tool_index instead of name search
  • Adds "original" to the parametrized test fixture so the parser is actually tested in streaming mode
  • Adds streamed_args_for_tool contract assertions
  • Update test cases

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @R3hankhan123.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 20, 2026
@mergify mergify Bot removed the needs-rebase label Apr 29, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 29, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @R3hankhan123.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 14, 2026

Hi @R3hankhan123, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

When streaming tool calls, the parser attempted to access `arguments[func_start:]`
before validating that `func_start` is within bounds. This caused an IndexError
when the function name appeared at the very end of the streamed chunk.

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models tool-calling

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Bug]: Qwen3-Coder-Next模型结合qwen3_coder这个tool parser时,报错IndexError: list index out of range

2 participants